Item Characteristic Curves generated from common CTT Item Statistics


Diego Figueiras1 John Kulas1

1 Montclair State University

Introduction

Item characteristic curves are frequently referenced by psychometricians as visual indicators of important attributes of assessment items - most frequently difficulty and discrimination. Assessment specialists who examine ICC’s usually do so from within the psychometric framework of either Item Response Theory (IRT) or Rasch modeling. These frameworks provide the parameters necessary to plot the ogive functions. If the curve transitions from low to high likelihood at a location toward the lower end of the trait (e.g., “left” on the plotting surface), this indicates that it is relatively easy to answer the item correctly. If the curve is sharp (e.g., strongly vertical), this indicates high discrimination; if it is flatter, that is an indication of poorer discrimination - see Figure 1).

Item characteristic curves reflecting visual differences in difficulty and discrimination.

Figure 1: Item characteristic curves reflecting visual differences in difficulty and discrimination.

From a Classical Test Theory (CTT) orientation, item difficulty is most commonly represented by the percent of individuals answering the item correctly (also referred to as a p-value). Item discrimination can be conveyed via a few different CTT indices, but the most commonly calculated and consulted index is the corrected item total correlation.

Method

We simulated data using Han (2007) software. Our sample was 10,000 observations, with a mean of 0 and a standard deviation of 1. The number of items were 100, with response categories of either correct or incorrect (1 and 0). The mean for the a-parameter for the simulated data was 2, and the standard deviation 0.8. The mean for the b-parameter was 0 and the standard deviation 0.5. The mirt package from Chalmers (2021) was used to compute the IRT a-parameters and to plot the 2PL resulting model. As for the CTT-derived a-parameter, the modification to Lord (2012)’s formula described below was used, as well as the re-scaling for the p-values. \[a_i\cong \frac{r_i}{\sqrt{1-r_i^2}}\]
&

\[\hat{a_i}\cong[(.51 + .02z_g + .3z_g^2)r]+[(.57 - .009z_g + .19z_g^2)\frac{e^r-e^{-r}}{e-e^r}]\]

We additionally changed the scale of the difficulty estimates of CTT so they were on the same scale as the IRT estimates. This was done by building a regression model using the CTT a-estimate to predict the IRT a-parameter. The resulting values from this model were used in plotting the CTT-derived ICC’s.

Results

The area between ICC’s was calculated between CTT-derived and IRT-derived ICC’s. The average difference for all 100 curves was 0.35. Most of the data is skewed towards the lower end, indicating that out of the 100 items, most of them have areas between the curves of less than 0.35. For Figure 2 we plotted all the 100 ICC’s that use CTT parameters, and for Figure 3 we did the same but with IRT parameters instead. Curves using both methodologies are very similar in shape and form, as we can see in the two items that we point out in each figure.

ICCs derived from only CTT parameters (with two noteworthy ICCs annotated).

Figure 2: ICCs derived from only CTT parameters (with two noteworthy ICCs annotated).

Typical ICCs derived from IRT parameters (same noteworthy items annotated).

Figure 3: Typical ICCs derived from IRT parameters (same noteworthy items annotated).

Discussion

Large scale data, truly random sampling, and large range items could give comparable CTT item and person statistics across testing populations and occasions (Kulas et al., 2017). Fan (1998) looked at the correlations between ability estimates and item difficulty in CTT and all three IRT models. These correlations were very high, generally between .80 and .90, showing a lot of overlap between the two methodologies. As for item discrimination, correlations were moderate to high, with only a few being very low. However, Kulas et al. (2017) provide an adjustment to Lord (2012)’s formula giving the functional relationship between the “non-invariant” CTT and “invariant” IRT discrimination statistics.

References

Chalmers, P. (2021). Mirt: Multidimensional item response theory. https://CRAN.R-project.org/package=mirt

Fan, X. (1998). Item response theory and classical test theory: An empirical comparison of their item/person statistics. Educational and Psychological Measurement, 58(3), 357–381.

Han, K. T. (2007). WinGen: Windows software that generates item response theory parameters and item responses. Applied Psychological Measurement, 31(5), 457–459.

Kulas, J. T., Smith, J. A., & Xu, H. (2017). Approximate functional relationship between irt and ctt item discrimination indices: A simulation, validation, and practical extension of lord’s (1980) formula. Journal of Applied Measurement, 18(4), 393–407.

Lord, F. M. (2012). Applications of item response theory to practical testing problems. Routledge.